Reducing cache misses



- Direct mapped --> there is a direct mapping \_\_\_\_\_\_ from any block address in memory to <u>a single</u>  $\frac{1}{|\mathcal{S}| \leq 1}$  location in the upper level of the hierarchy
- Are there other ways of placing blocks?



#### Set associative- Parking space analogy

- If we have 100 parking spaces we can divide them into 10 sets of 10 spaces each
- You have a choice of any place in your set
- Find your parking place within the set





Similar to 4 parallel DM caches

LW

luf.

Implementation of a 4-way 256-set SA cache addressing one word per block







4-way SA. Shits

- A cache is 4-way set-associative and has 64 KB data. Each block contains 32 bytes. The address is 32 bits wide.
- ——— How many sets does it have?
- What are the sizes of the tag, index, and block offset fields?

Traa 4 way 8-way
Merrory
acces Fully





- # bits in block offset = 5 (since each block contains 2^5 bytes)
- # blocks in cache  $\pm 64 \times 1024 / (32) = (2048)(2^{11})$
- # sets in cache = 2048  $\sqrt{512}$  (2^9) sets  $\sqrt{512}$ 
  - So # bits in index field = 8
- # bits in tag field = 32 5 9 = 18

### Fully associative

64 KB-



# Summary of Loads/Stores





### Summary









- Compare addr with only one tag
- Location X can be in exactly one cache line

- Compare addr with only N tags parallely
- Location X can be in one set, but in any of the N cache lines belonging to that set
- Compare addr with only all tags parallely
- Location X can be in any cache line

 Assume a 4 line DM cache consisting of four oneword blocks. Find the number of misses for the following sequence:

LW \$t0, 0x00

LW \$t0, 0x08

LW \$t0, 0x00

LW \$t0, 0x06

LW \$t0, 0x08





- 0 (0 mod 4) Block 0 Miss 8 - (8 mod 4) - Block 0 - Miss 0 - (0 mod 4) - Block 0 - Miss
- 6 (6 mod 4) Block 2 Miss
- 8 (8 mod 4) Block 0 Miss









 Assume a <u>2-way 2-line SA cache</u> consisting of four one-word blocks. Find the number of misses for:

LW \$t0, 0x00

LW \$t0, 0x08

LW \$t0, 0x00

LW \$t0, 0x06

LW \$t0, 0x08



 $0 - (0 \mod 2) - Set 0 - Hit$ 

6 - (6 mod 2) - Set 0 - Miss

8 - (8 mod 2) - Set 0 - Miss



 Assume a <u>FA cache</u> consisting of four oneword blocks. Find the number of misses for

LW \$t0, 0x00

LW \$t0, 0x08

LW \$t0, 0x00

LW \$t0, 0x06

LW \$t0, 0x08



#### Misses

- Compulsory—The first access to a block is not in the cache, so the block must be brought into the cache.
  - (Misses in infinite cache)

/ 3 c's-

- Capacity—If the cache cannot contain all the blocks needed during execution of a program (Misses due to size of cache)
- Conflict/Collision—If the block-placement strategy is set associative or direct mapped, conflict misses due to too many blocks mapping to same set.
  - (Misses due to associativity and size of cache) (Me)

### Cache read example



### Handling a cache <u>read</u> miss

- Send the original PC value (current PC 4) to the memory.
- Perform a read and wait for the memory to complete its access.
- Write the cache entry
  - Put the data from memory in the data portion of the entry, writing the upper bits of the address into the tag field, and turning the valid bit on.
- Restart the instruction execution at the first step, which will refetch the instruction, this time finding it in the cache.



### Write through

- Assume a store instruction
- Write through always updates both the cache and the next lower level of the memory hierarchy
  - - Causes delays in writing to the slower memory
  - - Performance
  - + Data is always consistent between the two
- Alternate: Use a write buffer to hold the data while data is being written to main memory or next level and then erase it
  - Processor stalls if the buffer is full
  - Works if processor's write generation rate < memory write rate</li>



- Updating values only to the block in the cache
- Write the modified data to the lower level when the block is replaced/evicted.
  - + Performance

- - Use an additional <u>dirty bit</u> to figure out which block was written to in the cache, but not yet written to in





#### Size of Tags versus Set Associativity

- Assuming a cache of 4096 blocks, a 4-word block size, and a 32-bit address, find the total number of tag bits for caches that are <u>Direct</u> <u>mapped</u>.
- Find the total number of sets and tag bits for four-way set associative, and fully associative caches.

#### Direct mapped:

- 16 bytes per block --> 4 bit offset
- 4096 lines --> 12 bit index
- 32 (4 + 12) = 16 bit tag
- No of tag bits = 16 \* 4k = 64k tag bits

#### 2-way SA:

- 16 bytes per block --> 4 bit offset
- 4096 blocks --> gets divided into 2 ways = 2048 sets --> 11 index bits
- 32 (4 + 11) = 17 bit tag
- No of tag bits = 17 \* 2 \* 2048 tag bits

#### 4-way SA:

- 16 bytes per block --> 4 bit offset
- 4096 blocks --> gets divided into <u>4 ways</u> = 1024 sets --> 10 index bits (Index bits have reduced)
- 32 (4+ 10) = 18 bit tag (Tag bits have increased)
- No of tag bits = 18 \* 4 \* 2048 tag bits

#### FA:

- 16 bytes per block --> 4 bit offset
- 4096 blocks --> 1 set of 4096 blocks --> No index bits
- 32 (4) = 28 bit tag (Tag bits have increased)
- No of tag bits = 28\* 1 \* 4096 tag bits





 Show the hits and misses and final cache contents for a <u>fully associative cache with</u> 2-word blocks and a total size of 8 words. Assume LRU replacement.



- Show the hits and misses and final cache contents for a <u>fully</u> <u>associative cache with 2-word blocks</u> and a total size of 8 words. Assume LRU replacement.
  - Assume addresses

4: 00000100, 0: 00000000, 8: 00001000 Since it is a 2-word block, we have to fetch 2 words (2 words = 8 bytes).

- When you get address 4, what will you fetch?
- Divide the address by 8.
- 0/8 and 4/8 yield = 0--> fetch these words
- 4/8 and 8/8 yield 0 and 1 as quotients. So, 4 and 8
   will index into different lines

#### Main memory

| 0                               |
|---------------------------------|
| 1                               |
| 2                               |
| 3                               |
| 4                               |
| 5                               |
| 1<br>2<br>3<br>4<br>5<br>6<br>7 |
| 7                               |
| 8<br>9                          |
| 9                               |
| 10                              |
| 11                              |
| 12                              |
| 13                              |
|                                 |
|                                 |

### Acknowledgements

- MITx- 6.004- Computation Structures-
  - https://www.youtube.com/watch?v=hg0dTS10uSk&list=PLDSlqjcPpoL64CJdF0Qee5oWqGS6we\_Yu&index=13
  - https://www.youtube.com/watch?v=-XqdkA931ag&list=PLDSlqjcPpoL64CJdF0Qee5oWqGS6we\_Yu&index=14
- http://home.ku.edu.tr/comp303/public\_html/ Lecture15.pdf

### Acknowledgements

- UCB- CS 252
  - Course page: http://inst.eecs.berkeley.edu/~cs152/sp18/
  - Memory:
     http://inst.eecs.berkeley.edu/~cs152/sp18/lectures/L 05-Memory.pdf
  - http://inst.eecs.berkeley.edu/~cs152/sp18/lectures/L 06-MemoryII.pdf

- Hennessey and Patterson
- Tech Review pages